AITopics | end-to-end speech-to-text translation

Collaborating Authors

end-to-end speech-to-text translation

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

ComSL: A Composite Speech-Language Model for End-to-End Speech-to-Text Translation

Neural Information Processing SystemsDec-26-2025, 14:50:56 GMT

Joint speech-language training is challenging due to the large demand for training data and GPU consumption, as well as the modality gap between speech and language. We present ComSL, a speech-language model built atop a composite architecture of public pre-trained speech-only and language-only models and optimized data-efficiently for spoken language tasks. Particularly, we propose to incorporate cross-modality learning into transfer learning and conduct them simultaneously for downstream tasks in a multi-task learning manner. Our approach has demonstrated effectiveness in end-to-end speech-to-text translation tasks, achieving a new state-of-the-art average BLEU score of 31.5 on the multilingual speech to English text translation task for 21 languages, as measured on the public CoVoST2 evaluation set.

composite speech-language model, end-to-end speech-to-text translation, name change, (4 more...)

Neural Information Processing Systems

Technology:

Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Speech > Speech Recognition (0.93)

Add feedback

When End-to-End is Overkill: Rethinking Cascaded Speech-to-Text Translation

Min, Anna, Hu, Chenxu, Ren, Yi, Zhao, Hang

arXiv.org Artificial IntelligenceFeb-1-2025

Abstract--Though end-to-end speech-to-text translation has been a great success, we argue that the cascaded speech-to-text translation model still has its place, which is usually criticized for the error propagation between automatic speech recognition (ASR) and machine translation (MT) models. In this paper, we explore the benefits of incorporating multiple candidates from ASR and self-supervised speech features into MT. Our analysis reveals that the primary cause of cascading errors stems from the increased divergence between similar samples in the speech domain when mapped to the text domain. By including multiple candidates and self-supervised speech features, our approach allows the machine translation model to choose the right words and ensure precise translation using various speech samples. This strategy minimizes error spread and takes advantage of large ASR and MT datasets, along with pre-trained ASR/MT models, while addressing associated issues. Recent studies [18], [19] have demonstrated the performance In recent years, the academic community has been intrigued improvements achieved by scaling up pre-trained models by the rapid advancement of end-to-end speech-to-text translation for downstream natural language processing tasks.

artificial intelligence, natural language, translation, (15 more...)

arXiv.org Artificial Intelligence

2502.00377

Country: